OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles

نویسندگان

  • Pierre Lison
  • Jörg Tiedemann
چکیده

We present a new major release of the OpenSubtitles collection of parallel corpora. The release is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages. The release also incorporates a number of enhancements in the preprocessing and alignment of the subtitles, such as the automatic correction of OCR errors and the use of meta-data to estimate the quality of each subtitle and score subtitle pairs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dual Subtitles as Parallel Corpora

In this paper, we leverage the existence of dual subtitles as a source of parallel data. Dual subtitles present viewers with two languages simultaneously, and are generally aligned in the segment level, which removes the need to automatically perform this alignment. This is desirable as extracted parallel data does not contain alignment errors present in previous work that aligns different subt...

متن کامل

Constructing Parallel Corpus from Movie Subtitles

This paper describes a methodology for constructing aligned German-Chinese corpora from movie subtitles. The corpora will be used to train a special machine translation system with intention to automatically translate the subtitles between German and Chinese. Since the common length-based algorithm for alignment shows weakness on short spoken sentences, especially on those from different langua...

متن کامل

Using Movie Subtitles for Creating a Large-Scale Bilingual Corpora

This paper presents a method for compiling a large-scale bilingual corpus from a database of movie subtitles. To create the corpus, we propose an algorithm based on Gale and Church’s sentence alignment algorithm(1993). However, our algorithm not only relies on character length information, but also uses subtitle-timing information, which is encoded in the subtitle files. Timing is highly correl...

متن کامل

Not All Dialogues are Created Equal: Instance Weighting for Neural Conversational Models

Neural conversational models require substantial amounts of dialogue data to estimate their parameters and are therefore usually learned on large corpora such as chat forums, Twitter discussions or movie subtitles. These corpora are, however, often challenging to work with, notably due to their frequent lack of turn segmentation and the presence of multiple references external to the dialogue i...

متن کامل

THE EFFECT OF STANDARD AND REVERSED SUBTITLING VERSUS NO SUBTITLING MODE ON L2 VOCABULARY LEARNING

Audiovisual material accompanied by interlingual subtitles is a powerful pedagogical tool which can help improve the vocabulary learning of second-language learners. This study was intended to determine whether or not the mode (standard and reversed) of subtitling affects the incidental vocabulary acquisition of Iranian L2 learners while watching TV programs. Forty-five participants were random...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016